Generating Training Datasets for Training Machine Learning Based Models for Predicting Behavior of Traffic Entities for Navigating Autonomous Vehicles

ABSTRACT

A vehicle collects video data of an environment surrounding the vehicle including traffic entities, e.g., pedestrians, bicyclists, or other vehicles. The captured video data is sampled and the sampled video frames are presented to users to provide input on a traffic entity&#39;s state of mind. The system determines an attribute value that describes a statistical distribution of user responses for the traffic entity. If the attribute for a sampled video frame is within a threshold of the attribute of another video frame, the system interpolates attribute for a third video frame between the two sampled video frames. Otherwise, the system requests further user input for a video frame captured between the two sampled video frames. The interpolated and/or user based attributes are used to train a machine learning based model that predicts a hidden context of the traffic entity. The trained model is used for navigation of autonomous vehicles.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.62/929,806 filed Nov. 2, 2019, which is incorporated by reference in itsentirety.

TECHNICAL FIELD

The disclosure relates in general to generating datasets for trainingmachine learning based models that can be used in navigating autonomousvehicles and more specifically to generating dense training datasets fortraining machine learning based models that predict hidden contexts oftraffic entities.

BACKGROUND

An autonomous vehicle uses different types of sensors to receive inputdescribing the surroundings (or environment) of the autonomous vehiclewhile driving through traffic. For example, an autonomous vehicle mayperceive the surroundings using camera images and lidar scans. Theautonomous vehicle determines whether an object in the surroundings isstationary, for example, buildings or trees or the object isnon-stationary, for example, a pedestrian, a vehicle, and so on. Theautonomous vehicle system predicts the motion of non-stationary objectsto make sure that the autonomous vehicle is able to navigate throughnon-stationary obstacles in the traffic.

Conventional systems predict motion of non-stationary objects usingkinematics. For example, autonomous vehicles may rely on methods thatmake decisions on how to control the vehicles by predicting “motionvectors” of people near the vehicles. This is accomplished by collectingdata of a person's current and past movements, determining a motionvector of the person at a current time based on these movements, andextrapolating a future motion vector representing the person's predictedmotion at a future time based on the current motion vector. However,these techniques fail to predict motion of certain non-stationaryobjects for example, pedestrians, bicyclists, and so on. For example, ifthe autonomous vehicle detects a pedestrian standing in a street corner,the motion of the pedestrian does not help predict whether thepedestrian will cross the street or whether the pedestrian will remainstanding on a street corner. Similarly, if the autonomous vehicledetects a bicyclist in a lane, the current motion of the bicycle doesnot help the autonomous vehicle predict whether the bicycle will changelanes. Failure of autonomous vehicles fail to accurately predict motionof non-stationary traffic objects results in unnatural movement of theautonomous vehicle, for example, as a result of the autonomous vehiclesuddenly stopping due to a pedestrian moving in the road or theautonomous vehicle continuing to wait for a person to cross a streeteven if the person never intends to cross the street.

Machine learning based models, for example, neural networks are used formaking various predictions to be able to navigate autonomous vehiclessmoothly through traffic. The quality of these machine learning basedmodels depends on the quality and the amount of training data used fortraining them. Current techniques for generating training data fortraining these machine learning based models require user input, forexample, to generate labeled training datasets. However, obtaining largescale user input to build such a training dataset is time consuming, andexpensive.

SUMMARY

A vehicle collects video data of an environment surrounding the vehicleusing sensors, for example, cameras mounted on the vehicle. The videodata comprises a sequence of video frames. The driving environmentincludes at least one traffic entity, such as a pedestrian, bicyclist,or another vehicle. A traffic entity is associated with a hiddencontext, for example, a state of mind of a pedestrian indicating anintention to cross a path of the vehicle or a measure of awareness ofthe vehicle. The captured video data is sampled to obtain a plurality ofvideo frames. The system annotates each sampled video frame, eachannotation specifying an attribute value describing a statisticaldistribution of user responses associated with a traffic entitydisplayed in the video frame. For example, the system presents a sampledvideo frame to a plurality of users and receives user responsesdescribing a hidden context associated with a traffic entity displayedin the video frame. The system aggregates the user responses todetermine the attribute value describing the statistical distribution ofuser responses for the traffic entity.

The system further annotates other video frames as follows. If theattribute value in a first sampled video frame is within a threshold ofthe attribute value of a second video frame, the system interpolatesattribute values for a third video frame between the two sampled videoframes. Otherwise, the system requests further user input for the thirdvideo frame captured between the two sampled video frames. Theinterpolated and/or user based attribute values are incorporated into atraining dataset used to train a machine learning based model thatpredicts hidden context associated with traffic entities. The machinelearning based model is provided to an autonomous vehicle and assistswith navigation of the autonomous vehicle.

In an embodiment, the system identifies a second pair of video framesfrom the plurality of video frames. The second pair of video framescomprises a fourth video frame and a fifth video frame. The time ofcapture of the fourth video frame and the time of capture of the fifthvideo frame is separated by a second time interval. The system comparesa fourth attribute value specified by the annotations of the fourthvideo frame and a fifth attribute value specified by the annotations ofthe fifth video frame. If the fourth attribute value is greater than thefifth attribute value by a threshold amount, the system identifies asixth video frame having a time of capture within the second timeinterval and sends the identified sixth video frame to a plurality ofusers for annotating the sixth video frame. The system receives userresponses associated with a hidden context associated with a trafficentity displayed in the identified video frame. The system annotates thesixth frame based on the user responses received from the plurality ofusers. The sixth video frame may be used for training a machine learningbased model. The process allows the system to select video frames thatare sent to users for annotation so that video frames that haverepetitive information have less chance of being sent to users toannotation and video frames that are more likely to improve the machinelearning based model via training are sent to users for annotation.Since the number of video frames captured can be very large, thetechniques disclosed allow effective and efficient annotation of videoframes and efficient utilization of available resources used forannotation of video frames so that effective training data is generatedfor training of machine learning based models.

BRIEF DESCRIPTION OF DRAWINGS

The disclosed embodiments have other advantages and features which willbe more readily apparent from the detailed description, the appendedclaims, and the accompanying figures (or drawings). A brief introductionof the figures is below.

FIG. 1 shows an overall system environment illustrating a system thatpredicts a traffic entity's state of mind based on user input, inaccordance with one or more embodiments.

FIG. 2 is a flowchart illustrating a method for training, with userinput, a machine learning model that predicts a traffic entity's stateof mind, in accordance with one or more embodiments.

FIGS. 3A-B are flowcharts illustrating a method for generating, byinterpolation, a dense training data set for training a machine learningmodel that predicts a traffic entity's state of mind, in accordance withone or more embodiments.

FIGS. 4A-4D show example statistical distributions of user input about atraffic entity's state of mind, in accordance with one or moreembodiments.

Reference will now be made in detail to several embodiments, examples ofwhich are illustrated in the accompanying figures. It is noted thatwherever practicable similar or like reference numbers may be used inthe figures and may indicate similar or like functionality. The figuresdepict embodiments of the disclosed system (or method) for purposes ofillustration only. One skilled in the art will readily recognize fromthe following description that alternative embodiments of the structuresand methods illustrated herein may be employed without departing fromthe principles described herein.

DETAILED DESCRIPTION

A busy driving environment may include a number of traffic entities,such as pedestrians, bicyclists, and other vehicles. For example, apedestrian may decide to cross a busy street and a bicyclist may drivein a bike lane alongside vehicles. The traffic entities are associatedwith hidden context comprising attributes. The hidden contextattributes, along with external factors such as road signs, trafficlights, and other traffic entities' actions, contribute to the decisionstraffic entities make with regards to their visible actions andmovements. Hidden context attributes are distinct from attributes thatdescribe the movement, for example, motion vector of the traffic entity.Hidden context attributes describe a state of mind of a traffic entity.For example, a state of mind may include a measure of awareness of thevehicle in the mind of the traffic entity. The movement of the trafficentity is determined by the hidden context, for example, if the state ofmind indicates that the user plans on crossing the street, the user islikely to move in front of the vehicle, even if the motion vectorindicates otherwise. Unlike human drivers, autonomous vehicles lack theinnate ability to judge intentions of traffic entities.

The system described herein presents video frames including activitiesof traffic entities to a set of users, who provide input on the trafficentities' behavior. The video frames may be from the point of view ofthe autonomous vehicle, such that the user input provides information onthe traffic entity relative to the autonomous vehicle. For example, theset of users may provide responses to questions on how likely a trafficentity is to cross the path of the autonomous vehicle, direction inwhich the traffic entity is traveling, and how likely it is that thetraffic entity is aware of the autonomous vehicle's presence, forexample. The system annotates the video frames based on the userresponses. For example, the system may determine values of one or moreattributes that represents statistical distribution of the userresponses describing the hidden context associated with traffic entityand annotate the video frame with the one or more attributes. The userresponses are used to build a dataset that is used in training aplurality of machine learning based models. The machine learning basedmodels take, as input, a set of video frames showing traffic entities,and provide output that predict a statistical distribution of userresponses regarding the traffic entities' states of mind and actions,i.e., the traffic entities' behavior. The machine learning based modelmay be provided to an autonomous vehicle. The autonomous vehicle may usethe predicted user distribution of the traffic entities' states of mindto assist in the navigation of the autonomous vehicle. In particular,the machine learning based model may assist with multi-object tracking,path planning, motion planning, and/or other navigation tasks relevantto the autonomous vehicle.

The video frames presented to users are sampled from a sequence of videoframes, such that the user responses are received only for the samplesvideo frames and not for every frame in the sequence. In such cases, thebehavior of the traffic entity may differ greatly between a first frameshown to the user and a second frame shown to the user. The systemgenerates more data on the behavior of the traffic entity between thefirst frame and the second frame, either via interpolation or seekingmore user responses, to generate a denser dataset to train the machinelearning based models with.

Systems and methods for predicting user interaction with vehicles aredisclosed in the U.S. patent application Ser. No. 15/830,549, filed onDec. 4, 2017, which is incorporated by reference herein in its entirely.

System Environment

FIG. 1 shows an overall system environment illustrating a system 100that predicts hidden context describing traffic entities based on userinput, in accordance with one or more embodiments. The systemenvironment 100 includes a vehicle 102, a network 104, a server 106which hosts a user response database 110, a client device 108, a modeltraining system 112, and a prediction engine 114. The network 104connects the vehicle 102 with the server 106 and the model trainingsystem 112.

The vehicle 102 may be any manual or motorized vehicle, such as a car,bus, or bicycle. In some embodiments, the vehicle 102 may be anautonomous vehicle. The vehicle 102 monitors its surroundingenvironment, capturing events in the surrounding environment throughvideo data. For example, the vehicle 102 may include an image sensor, ora camera that records a sequence of video frames that capture activitiesin the surrounding environment. Data may be collected from cameras orother sensors including solid state Lidar, rotating Lidar, medium rangeradar, or others mounted on the car in either a fixed or temporarycapacity and oriented such that they capture images of the road ahead,behind, and/or to the side of the car. In some embodiments, the sensordata is recorded on a physical storage medium such as a compact flashdrive, hard drive, solid state drive or dedicated data logger. The videoframes may include traffic entities and their actions, such as apedestrian crossing a crosswalk in front of the vehicle 102, a bicyclistriding alongside the vehicle 102 in a bike lane, and other vehicleswaiting to turn on to a cross street.

The network 104 may be any wired and/or wireless network capable ofreceiving sensor data collected by the vehicle 102 and distributing itto the server 106, the model training system 112, and, through the modeltraining system 112, the prediction engine 114. In some embodiments, thenetwork 104 may use standard communication protocols and comprise localarea networks, wide area networks, or a combination thereof.

The server 106 may be any computer implemented system capable ofhosting, providing content to users, and receiving input from users onthe content. The content may include image, video, and text information.The server 106 provides the content to each of the users via the user'sclient device 108. The server 106 may present, to the users, the contentand request input on the content. Users may be asked questions relatingto hidden context of a traffic entity, for example, state of mind of anindividual corresponding to a traffic entity to which the users canrespond. In some embodiments, the users may respond by ranking howlikely a state of mind is, using the Likert scale. For example,questions presented to the user may relate to how likely a trafficentity is to cross the path of the autonomous vehicle, the direction inwhich the traffic entity is traveling, and how likely it is that thetraffic entity is aware of the autonomous vehicle's presence. The server106 derives an annotation for the video frames shown to the users,wherein the annotation describes the statistical distribution of theuser responses. In an embodiment, the statistical distribution comprisesa mean value and standard deviation. Other embodiments may use othermeasures of statistical distribution, for example, any measurements ofthe central tendency of the distribution of scores like the mean,median, or mode. They could include measurements of the heterogeneity ofthe scores like variance, standard deviation, skew, kurtosis,heteroskedasticity, multimodality, or uniformness. They could alsoinclude summary statistics like those above calculated from the implicitmeasurements of the responses listed above.

The user responses are stored in the user response database 110. Theserver 106 may be capable of receiving content and sending data via thenetwork 104, as well. For example, the server 106 receives the contentfrom the vehicle 102, and provides data on the user responses to themodel training system 112.

Each user provides input on the content presented by the server 106 viaa client device 108. The client device 108 is a computing device capableof receiving user input, as well as transmitting and/or receiving datavia the network 104. In some embodiments, the client device 108 may be acomputer system, such as a desktop or a laptop computer. The clientdevice 108 may also be a device with mobile device that enables the userto interact with the server 106.

The model training system 112 trains machine learning based models thatpredict the state of mind, including intentions and behavior, of trafficentities in areas surrounding the vehicle 102. Different machinelearning techniques can be used to train the machine learning modelincluding, but not limited to decision tree learning, association rulelearning, artificial neural network learning, convolutional neuralnetworks, deep learning, support vector machines (SVM), clusteranalysis, Bayesian algorithms, regression algorithms, instance-basedalgorithms, and regularization algorithms. In some embodiments, themodel training system 112 may withhold portions of the training dataset(e.g., 10% or 20% of full training dataset) and train a machine learningmodel on subsets of the training dataset. For example, the modeltraining system 112 may train different machine learning models ondifferent subsets of the training dataset for the purposes of performingcross-validation to further tune the parameters. In some embodiments,because candidate parameter values are selected based on historicaldatasets, the tuning of the candidate parameter values may besignificantly more efficient in comparison to randomly identified (e.g.,naïve parameter sweep) candidate parameters values. In other words, themodel training system 112 can tune the candidate parameter values inless time and while consuming fewer computing resources.

The machine learning based models are trained using a process ofprogressively adjusting the parameters of model in response to thecharacteristics of the images and summary statistics given to it in thetraining phase to minimize the error in its predictions of the summarystatistics for the training images. In one embodiment of the modeltraining system 112, the algorithm can be a deep neural network. In thisembodiment, the parameters are the weights attached to the connectionsbetween the artificial neurons comprising the network. Pixel data froman image in a training set collated with human observer summarystatistics can serve as an input to the network. This input can betransformed according to a mathematical function by each of theartificial neurons, and then the transformed information can betransmitted from that artificial neuron to other artificial neurons inthe neural network. The transmission between the first artificial neuronand the subsequent neurons can be modified by the weight parametersdiscussed above. In this embodiment, the neural network can be organizedhierarchically such that the value of each input pixel can betransformed by independent layers (e.g., 10 to 20 layers) of artificialneurons, where the inputs for neurons at a given layer come from theprevious layer, and all of the outputs for a neuron (and theirassociated weight parameters) go to the subsequent layer. At the end ofthe sequence of layers, in this embodiment, the network can producenumbers that are intended to match the human response statistics givenat the input. The difference between the numbers that the network outputand the human response statistics provided at the input comprises anerror signal. An algorithm (e.g., back-propagation) can be used toassign a small portion of the responsibility for the error to each ofthe weight parameters in the network. The weight parameters can then beadjusted such that their estimated contribution to the overall error isreduced. This process can be repeated for each image (or for eachcombination of pixel data and human observer summary statistics) in thetraining set collected. At the end of this process the model is“trained”, which in some embodiments, means that the difference betweenthe summary statistics output by the neural network and the summarystatistics calculated from the responses of the human observers isminimized.

Ultimately, predictions of a traffic entity's state of mind facilitatethe navigation of autonomous vehicles, in particular with multi-objecttracking, path planning, motion planning, and/or other navigation tasksrelevant to the autonomous vehicle. The model training system 112 takesin the data on the user responses to video frames showing the activitiesof traffic entities, and models the statistical distribution of the userresponses. In one embodiment, the model training system 112 receivesimage, video, and/or text information and accompanying user responsesfrom the database 110 over the network 104. In some embodiments, theuser responses may include discrete values of text or free textresponses. The model training system 112 can use images, video segmentsand text segments as training examples to train an algorithm, and cancreate labels from the accompanying user responses based on the trainedalgorithm. These labels indicate how the algorithm predicts the behaviorof the people in the associated image, video, and/or text segments.After the labels are created, the model training system 112 providesthem to the prediction engine 114.

The prediction engine 114 outputs a predicted distribution of userresponses associated with a video frame. The predicted distribution ofuser responses may include predictions on identified traffic entitiesand on the states of mind of the traffic entities. The model trainingsystem 121 may train an algorithm in the prediction engine 114. Theoutput of the prediction engine 114 is used in facilitating thenavigation of autonomous vehicles. For example, the output of theprediction engine 114 may be used to determine the control signalsprovided to a control system of the autonomous vehicle to navigate theautonomous vehicle including the accelerator, steering, braking system,and so on.

FIG. 2 is a flowchart illustrating a method 200 for training, with userinput, a machine learning model that predicts a traffic entity's stateof mind, in accordance with one or more embodiments.

A camera or another image sensor collects 210 video data of a drivingenvironment. In some embodiments, the video data may be recorded from avehicle (e.g., the vehicle 102). The video data comprises a plurality ofvideo frames. The video frames may include traffic entities, such aspedestrians, bicyclists, and other motorized vehicles that are in thedriving environment. The traffic entities may be stationary or moving inthe video. The captured video data is sent to a server (e.g., the server106) over a network (the network 104). In some embodiments, a pluralityof sensors may collect sensor data, other than image data, about thedriving environment about the vehicle. For example, the sensor data mayinclude lidar and radar, among others.

The server provides the video frames to a prediction engine (e.g., theprediction engine 114) that identifies 220 the traffic entities withinthe video frames. The server subsequently presents the video frames to aplurality of users, each of whom access the video frames via a clientdevice (e.g., the client device 108).

The server requests and collects 230 user inputs on each of the trafficentities within the video frames. The users may provide responses as towhether the prediction engine correctly identified each of the trafficentities, and if not, correctly identify the traffic entities. The usersmay also provide responses on a state of mind of each of the trafficentities. For example, the user may be asked how likely the trafficentity is to cross the path of the autonomous vehicle, the direction inwhich the traffic entity is traveling, and how likely it is that thetraffic entity is aware of the autonomous vehicle's presence, amongother questions on the traffic entity's behavior and intentions. Inresponding to the questions on the traffic entity's state of mind, theuser may rank the likelihood of occurrence of a particular state ofmind. For example, the user may rank, from a scale of one to five, howlikely a person is to cross the street. A plurality of user responsesmay be aggregated to form a distribution of user responses relating tothe traffic entity's activities.

A machine learning based model is trained 240 with the user input on thetraffic entities in the video frames. The machine learning model may betrained by a model training system (e.g., the model training system112). The trained model ultimately predicts, via a prediction engine(e.g., the prediction engine 114) a distribution of user responses abouta traffic entity.

Annotating Between Sampled Video Frames

The quality of a machine learning based model largely depends on thequality and amount of training data used for training the machinelearning based model. The training data used by the model trainingsystem 112 is generated from user responses to video frames. While userresponses for all or most video frames would be ideal, video data may beso voluminous that requesting such large scale user input isimpractical, considering the cost and time to do so. The system maysample frames from a large set of video data and request user input onthe sampled frames. In cases where the sampled frames are notconsecutive, the system interpolates in between the sampled frames topredict the distribution of user responses. The predicted user responsesadd to a training data set used to train the machine learning model thatpredicts traffic entity intentions, as described in FIG. 2.

FIGS. 3A-3B are flowcharts illustrating a method 300 for generating, byinterpolation, a dense training data set for training a machine learningmodel that predicts a traffic entity's state of mind, in accordance withone or more embodiments.

A server (e.g., the server 106) receives 310 a sequence of video framescapturing a driving environment. The server samples 320 the sequence ofvideo frames to present to users a plurality of video frames, such thatusers do not provide input on all the video frames in the sequence. Thevideo frames presented to the user may not be consecutive. For example,a first sampled video frame may capture a driving environment at 2:45pm, while a second sampled video frame may capture the same drivingenvironment on the same day, at 2:50 pm, leaving a five minute timeinterval in between the two sampled video frames. Similarly, the systemmay sample a fourth and fifth video frame, with a second time intervalin between each one. Each of the video frames includes at least onetraffic entity about which the user provides input.

Once the user is presented with the sampled video frames, the serverreceives 330 annotations from a set of users for each of the sampledvideo frame. Each annotation specifies an attribute value describing astatistical distribution of the set of user responses associated with atraffic entity in the sampled video frame. For example, an attributevalue may indicate a statistical distribution of user responsesindicating a likelihood, on a scale of one to five, that a trafficentity in a sampled video frame is aware of the vehicle (e.g., thevehicle 102) from which the video data was collected. The statisticaldistribution of user responses may be indicated by an attribute valuevia variance, standard deviation, interquartile range, or somecombination thereof.

The system compares 340 attribute values for the first video frame andthe second video frame. The first video frame and the second video frameare nonconsecutive, as mentioned above.

The system identifies 350 that the attribute value for the first videoframe, i.e., the first attribute value, is within a threshold of theattribute value for the second video frame, i.e., the second attributevalue. A first attribute value within the threshold of the secondattribute value may indicate that the actions of the traffic entity maynot differ greatly between the first

The system annotates 360 a third video frame captured in the timeinterval between the first and the second video frames. The systeminterpolates between the first attribute value and the second attributevalue to predict attribute values for the third video frame. Theannotations for the third video frame indicate a predicted statisticaldistribution of user responses for a traffic entity in the third videoframe.

In one embodiment, the following relationship between the time intervaland the first and the second attribute values may determine how thesystem interpolates attribute values for the third video frame.

Suppose that the third video frame is captured in the time interval [t₁,t₂], and that the system seeks to determine an attribute value at timet₃ within the time interval. The attribute value may depend on thedifference between t₃, t₁, and t₂. For example, if t₃-t₁<t₂-t₃, then thesystem may assign an attribute value at t₃ closer to the attribute valueat t₁. Similarly, if t₃-t₁>t₂-t₃, the system may assign an attributevalue at t₃ closer to the attribute value at t₂. Accordingly, the systemfits a linear function based on the attribute values at times t₁ and t₂and uses the linear function to determine attribute values at varioustimes in between t₁ and t₂. In other embodiments, the system may fit ahigher order curve (for example, a parabolic curve) through attributesvalues corresponding to a set of time points.

In some cases, the attribute values may not be within the threshold, inwhich case the system may seek further user responses for the videoframes.

The system compares 370 attribute values for the fourth video frame andthe fifth video frame. The system identifies 380 that the fourth videoframe's fourth attribute value differs more than a threshold of thefifth attribute value.

The system seeks further user responses for a sixth video frame capturedin the second time interval between the fourth and fifth video frames.The sixth video frame is presented to the set of users, who provideinput on a traffic entity in the video frame. Based on the userresponses, the system annotates 390 the sixth video frame.

The system either interpolates and/or annotates each video frame basedon the attribute values, thereby creating a denser dataset than providedby user responses.

FIGS. 4A-4D show example statistical distributions of user input about atraffic entity 405's state of mind, in accordance with one or moreembodiments. FIGS. 4A-4D show a plurality of video frames 400, 420, 440,and 460 shown to a set of users. Each of the frames show the trafficentity 405 in a driving environment, and the users are requested toprovide input on the traffic entity 405's actions and intentions. Inparticular, in FIGS. 4A-4D, the users are requested to provide aresponse as to the likelihood of the traffic entity 405 crossing a roadin the driving environment that may be in the path of the vehiclecapturing the video data. The user responses are scaled from one tofive, indicating a likelihood ranging from very low to very high. Thesystem aggregates a plurality of user responses to provide an attributevalue that describes a statistical distribution of the users' responsesfor each video frame. In FIGS. 4A-4D, the attribute values maycorrespond to the statistical distribution of user response shown inplots 410, 430, 450, and 470. While FIGS. 4A-4D show user input on thelikelihood of the traffic entity 405's crossing the road in eachcorresponding video frame, the user input may also indicate a measure ofawareness of a vehicle around them.

In FIG. 4A, the plot 410 describes the statistical distribution of userresponses for the video frame 400. The plot 410 suggests that most usersbelieve that the traffic entity 405 has a low likelihood of crossing thestreet.

In FIG. 4B, users are presented the video frame 420, in which thetraffic entity 405 has moved slightly in comparison to its position inthe video frame 400. The plot 430 shows that most users still believethat the traffic entity 405 has a relatively low likelihood of crossingthe street.

In FIG. 4C, users are presented the video frame 440, in which thetraffic entity 405 moves again slightly in comparison to its position inthe video frame 420. The plot 450 shows that more users believe thatthere is a likelihood of crossing.

In. FIG. 4D, the video frame 460 shows the traffic entity 405 steppingonto a crosswalk. The plot 470 shows that users presented with the videoframe 460 believe that there is a relatively high likelihood of thetraffic entity 405 crossing the street.

The calculated attribute values for the video frames 400, 420, and 440may be similar and/or may be within thresholds of each other, indicatingthat the statistical distribution of user responses to those videoframes are similar. For the video frames 400, 420, and 440, this mayindicate that most users believe that the traffic entity 405 is not verylikely to cross the road.

In some embodiments, the video frames 400, 420, 440, and 460 may besampled, such that only the video frames 400 and 440 are shown to theset of users. The calculated attribute value for the video frame 400 maybe within a threshold of the attribute value for the video frame 440,given that the user responses are similar between the two video frames400 and 440. The system may interpolate, as described by FIG. 3A,between the video frame 400's attribute value and the video frame 440'sattribute value to generate a predicted attribute value, i.e., apredicted user distribution for the video frame 420 captured in betweenthe video frames 400 and 440.

In contrast, suppose the user is presented with the video frame 400 andsubsequently, the video frame 460. The system calculated attribute valuefor the video frame 460 may be greater than a threshold of the attributevalue for the video frame 400, as evidenced by the difference in theplots of the statistical distribution of user responses. While the plot410 shows that users believe that the traffic entity 405 is relativelyunlikely to cross the road in the video frame 400, the plot 470 showsthat users believe that the traffic entity 405 is more likely to crossthe road in the video frame 460. Accordingly, the system may requestusers to provide input on another video frame captured in the timeinterval between the capturing of the video frames 440 and 460. Thesystem uses the set of users' responses to determine attribute valuesfor the video frame captured in between the video frames 440 and 460.

The interpolated and/or user based attribute values contribute tobuilding a denser dataset of user responses on a traffic entity's stateof mind. The dataset is input into a machine learning model that, inresponse to a video frame with a traffic entity, outputs a predicteddistribution of user inputs about the traffic entity's state of mind.

Alternative Embodiments

The features and advantages described in the specification are not allinclusive and in particular, many additional features and advantageswill be apparent to one of ordinary skill in the art in view of thedrawings, specification, and claims. Moreover, it should be noted thatthe language used in the specification has been principally selected forreadability and instructional purposes, and may not have been selectedto delineate or circumscribe the disclosed subject matter.

It is to be understood that the figures and descriptions have beensimplified to illustrate elements that are relevant for a clearunderstanding of the present invention, while eliminating, for thepurpose of clarity, many other elements found in a typical onlinesystem. Those of ordinary skill in the art may recognize that otherelements and/or steps are desirable and/or required in implementing theembodiments. However, because such elements and steps are well known inthe art, and because they do not facilitate a better understanding ofthe embodiments, a discussion of such elements and steps is not providedherein. The disclosure herein is directed to all such variations andmodifications to such elements and methods known to those skilled in theart.

Some portions of above description describe the embodiments in terms ofalgorithms and symbolic representations of operations on information.These algorithmic descriptions and representations are commonly used bythose skilled in the data processing arts to convey the substance oftheir work effectively to others skilled in the art. These operations,while described functionally, computationally, or logically, areunderstood to be implemented by computer programs or equivalentelectrical circuits, microcode, or the like. Furthermore, it has alsoproven convenient at times, to refer to these arrangements of operationsas modules, without loss of generality. The described operations andtheir associated modules may be embodied in software, firmware,hardware, or any combinations thereof.

As used herein any reference to “one embodiment” or “an embodiment”means that a particular element, feature, structure, or characteristicdescribed in connection with the embodiment is included in at least oneembodiment. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

Some embodiments may be described using the expression “coupled” and“connected” along with their derivatives. It should be understood thatthese terms are not intended as synonyms for each other. For example,some embodiments may be described using the term “connected” to indicatethat two or more elements are in direct physical or electrical contactwith each other. In another example, some embodiments may be describedusing the term “coupled” to indicate that two or more elements are indirect physical or electrical contact. The term “coupled,” however, mayalso mean that two or more elements are not in direct contact with eachother, but yet still co-operate or interact with each other. Theembodiments are not limited in this context.

As used herein, the terms “comprises,” “comprising,” “includes,”“including,” “has,” “having” or any other variation thereof, areintended to cover a non-exclusive inclusion. For example, a process,method, article, or apparatus that comprises a list of elements is notnecessarily limited to only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus. Further, unless expressly stated to the contrary,“or” refers to an inclusive or and not to an exclusive or. For example,a condition A or B is satisfied by any one of the following: A is true(or present) and B is false (or not present), A is false (or notpresent) and B is true (or present), and both A and B are true (orpresent).

In addition, use of the “a” or “an” are employed to describe elementsand components of the embodiments herein. This is done merely forconvenience and to give a general sense of the various embodiments. Thisdescription should be read to include one or at least one and thesingular also includes the plural unless it is obvious that it is meantotherwise.

Upon reading this disclosure, those of skill in the art will appreciatestill additional alternative structural and functional designs for asystem and a process for displaying charts using a distortion regionthrough the disclosed principles herein. Thus, while particularembodiments and applications have been illustrated and described, it isto be understood that the disclosed embodiments are not limited to theprecise construction and components disclosed herein. Variousmodifications, changes and variations, which will be apparent to thoseskilled in the art, may be made in the arrangement, operation anddetails of the method and apparatus disclosed herein without departingfrom the spirit and scope defined in the appended claims.

What is claimed is:
 1. A method comprising: receiving a sequence ofvideo frames captured by a camera mounted on an autonomous vehicle;sampling the sequence of video frames to obtain a subset of videoframes; annotating each of the subset of video frames obtained bysampling, each annotation specifying an attribute value describing astatistical distribution of user responses describing a hidden contextfor a traffic entity displayed in the video frame; identifying a pair ofvideo frames from the subset of video frames, the pair of video framescomprising a first video frame and a second video frame, wherein a timeof capture of the first video frame and a time of capture of the secondvideo frame is separated by a first time interval; comparing a firstattribute value associated with the first video frame and a secondattribute value associated with the second video frame; responsive tothe first attribute value being within a threshold of the secondattribute value, annotating a third video frame from the sequence ofvideo frames having a time of capture within the first time interval byinterpolating using the first attribute value and the second attributevalue; providing a training data set including the annotated subset ofvideo frames and the third video frame for training a machine learningmodel, the machine learning model configured to receive an input videoframe displaying a traffic entity and predict information describing thetraffic entity; and providing the trained machine learning model to theautonomous vehicle to assist with navigation in traffic.
 2. The methodof claim 1, the generating further comprising: identifying a second pairof video frames from the subset of video frames, the second pair ofvideo frames comprising a fourth video frame and a fifth video frame,wherein the time of capture of the fourth video frame and the time ofcapture of the fifth video frame is separated by a second time interval;comparing a fourth attribute value associated with the fourth videoframe and a fifth attribute value associated with the fifth video frame;and responsive to the fourth attribute value being greater than thethreshold of the fifth attribute value: identifying a sixth video framefrom the sequence of video frames having a time of capture within thesecond time interval; sending the sixth video frame to a plurality ofusers; and annotating the sixth frame based on responses from theplurality of users, the responses describing a hidden context associatedwith a traffic entity displayed in the sixth video frame.
 3. The methodof claim 2, wherein the annotated sixth frame is included in thetraining data set.
 4. The method of claim 1, wherein the informationpredicted by the machine learning based model comprises a statisticaldistribution of user responses describing a state of mind of a humanassociated with the traffic entity displayed in the input video frame.5. The method of claim 4, wherein the state of mind of the humancomprises an intention of the traffic entity to cross a path of theautonomous vehicle.
 6. The method of claim 4, wherein the state of mindof the human further comprises a measure of awareness of the autonomousvehicle.
 7. The method of claim 1, wherein an annotation of a videoframe includes a set of parameters indicating a statistical distributionof the attribute values provided by a set of users upon being presentedwith the video frame.
 8. The method of claim 1, wherein interpolatingbetween the first attribute value and the second attribute value isbased on the time of capture of the third video frame within the timeinterval.
 9. A computer readable non-transitory storage medium storinginstructions, the instructions when executed by a processor cause theprocessor to perform steps comprising: receiving a sequence of videoframes captured by a camera mounted on an autonomous vehicle; samplingthe sequence of video frames to obtain subset of video frames;annotating each of the subset of video frames obtained by sampling, eachannotation specifying an attribute value describing a statisticaldistribution of user responses describing a hidden context for a trafficentity displayed in the video frame; identifying a pair of video framesfrom the subset of video frames, the pair of video frames comprising afirst video frame and a second video frame, wherein a time of capture ofthe first video frame and a time of capture of the second video frame isseparated by a first time interval; comparing a first attribute valueassociated with the first video frame and a second attribute valueassociated with the second video frame; responsive to the firstattribute value being within a threshold of the second attribute value,annotating a third video frame from the sequence of video frames havinga time of capture within the first time interval by interpolating usingthe first attribute value and the second attribute value; providing atraining data set including the annotated subset of video frames and thethird video frame for training a machine learning model, the machinelearning model configured to receive an input video frame displaying atraffic entity and predict information describing the traffic entity;and providing the trained machine learning model to the autonomousvehicle to assist with navigation in traffic.
 10. The computer readablenon-transitory storage medium of claim 9, wherein the generating furthercomprises: identifying a second pair of video frames from the subset ofvideo frames, the second pair of video frames comprising a fourth videoframe and a fifth video frame, wherein the time of capture of the fourthvideo frame and the time of capture of the fifth video frame isseparated by a second time interval; comparing a fourth attribute valueassociated with the fourth video frame and a fifth attribute valueassociated with the fifth video frame; and responsive to the fourthattribute value being greater than the threshold of the fifth attributevalue: identifying a sixth video frame from the sequence of video frameshaving a time of capture within the second time interval; sending thesixth video frame to a plurality of users; annotating the sixth framebased on responses from the plurality of users, the responses describinga hidden context associated with a traffic entity displayed in the sixthvideo frame.
 11. The computer readable non-transitory storage medium ofclaim 10, wherein the annotated sixth frame is included in the trainingdata set.
 12. The computer readable non-transitory storage medium ofclaim 9, wherein the information predicted by the machine learning basedmodel comprises a statistical distribution of user responses describinga state of mind of a human associated with the traffic entity displayedin the input video frame.
 13. The computer readable non-transitorystorage medium of claim 12, wherein the state of mind of the humancomprises an intention of the traffic entity to cross a path of theautonomous vehicle.
 14. The computer readable non-transitory storagemedium of claim 12, wherein the state of mind of the human furthercomprises a measure of awareness of the autonomous vehicle.
 15. Thecomputer readable non-transitory storage medium of claim 9, wherein anannotation of a video frame includes a set of parameters indicating astatistical distribution of the attribute values provided by a set ofusers upon being presented with the video frame.
 16. The computerreadable non-transitory storage medium of claim 9, wherein interpolatingbetween the first attribute value and the second attribute value isbased on the time of capture of the third video frame within the timeinterval.
 17. A computer implemented system comprising: a computerprocessor; a computer readable non-transitory storage medium storinginstructions thereon, the instructions when executed by a processorcause the processor to perform steps of: receiving a sequence of videoframes captured by a camera mounted on an autonomous vehicle; samplingthe sequence of video frames to obtain subset of video frames;annotating each of the subset of video frames obtained by sampling, eachannotation specifying an attribute value describing a statisticaldistribution of user responses describing a hidden context for a trafficentity displayed in the video frame; identifying a pair of video framesfrom the subset of video frames, the pair of video frames comprising afirst video frame and a second video frame, wherein a time of capture ofthe first video frame and a time of capture of the second video frame isseparated by a first time interval; comparing a first attribute valueassociated with the first video frame and a second attribute valueassociated with the second video frame; responsive to the firstattribute value being within a threshold of the second attribute value,annotating a third video frame from the sequence of video frames havinga time of capture within the first time interval by interpolating usingthe first attribute value and the second attribute value; providing atraining data set including the annotated subset of video frames and thethird video frame for training a machine learning model, the machinelearning model configured to receive an input video frame displaying atraffic entity and predict information describing the traffic entity;and providing the trained machine learning model to the autonomousvehicle to assist with navigation in traffic.
 18. The computer system ofclaim 17, wherein the generating further comprises: identifying a secondpair of video frames from the subset of video frames, the second pair ofvideo frames comprising a fourth video frame and a fifth video frame,wherein the time of capture of the fourth video frame and the time ofcapture of the fifth video frame is separated by a second time interval;comparing a fourth attribute value associated with the fourth videoframe and a fifth attribute value associated with the fifth video frame;and responsive to the fourth attribute value being greater than thethreshold of the fifth attribute value: identifying a sixth video framefrom the sequence of video frames having a time of capture within thesecond time interval; sending the sixth video frame to a plurality ofusers; and annotating the sixth frame based on responses from theplurality of users, the responses describing a hidden context associatedwith a traffic entity displayed in the sixth video frame.
 19. Thecomputer system of claim 18, wherein the annotated sixth frame isincluded in the training data set.
 20. The computer system of claim 17,wherein interpolating between the first attribute value and the secondattribute value is based on the time of capture of the third video framewithin the time interval.